*Dataset available at " https://www.kaggle.com/datasets/sakshigoyal7/credit-card-customers " .
A manager at the bank is disturbed with more and more customers leaving their credit card services. They would really
appreciate if one could predict for them who is gonna get churned so they can proactively go to the customer to provide
them better services and turn customers' decisions in the opposite direction
Now, this dataset consists of 10,000 customers mentioning their age, salary, marital_status, credit card limit, credit
card category, etc. There are nearly 18 features.
We have only 16.07% of customers who have churned. Thus, it's a bit difficult to train our model to predict churning
customers.
Attrition_Flag -> Internal event (customer activity) variable - if the account is closed then 1 else 0 - We gotta do this change!!!
Marital_Status -> Demographic variable - Married, Single, Divorced, Unknown
Income_Category -> Demographic variable - Annual Income Category of the account holder (< $40K, $40K - 60K, $60K - $80K, $80K-$120K, > $120K, Unknown)
Months_on_book -> Period of relationship with bank
Total_Relationship_Count -> Total no. of products held by the customer
Months_Inactive_12_mon -> No. of months inactive in the last 12 months
Contacts_Count_12_mon -> No. of Contacts in the last 12 months
Total_Revolving_Bal -> Total Revolving Balance on the Credit Card
Avg_Open_To_Buy -> Open to Buy Credit Line (Average of last 12 months)
Total_Amt_Chng_Q4_Q1 -> Change in Transaction Amount (Q4 over Q1)
Total_Trans_Amt -> Total Transaction Amount (Last 12 months)
Total_Trans_Ct -> Total Transaction Count (Last 12 months)
Total_Ct_Chng_Q4_Q1 -> Change in Transaction Count (Q4 over Q1)
Avg_Utilization_Ratio -> Average Card Utilization Ratio
import pandas as pd
import numpy as np
df_raw = pd.read_csv("BankChurners.csv")
df1 = df_raw.iloc[: , :-2].drop("CLIENTNUM", axis=1)
#advise from "dataset uploader" to delete the last two columns - "naive bayes class..."
df1 = df1.replace(['Existing Customer'], 0) #replacing labels
df1 = df1.replace(['Attrited Customer'], 1)
display(df1)
| Attrition_Flag | Customer_Age | Gender | Dependent_count | Education_Level | Marital_Status | Income_Category | Card_Category | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 45 | M | 3 | High School | Married | $60K - $80K | Blue | 39 | 5 | 1 | 3 | 12691.0 | 777 | 11914.0 | 1.335 | 1144 | 42 | 1.625 | 0.061 |
| 1 | 0 | 49 | F | 5 | Graduate | Single | Less than $40K | Blue | 44 | 6 | 1 | 2 | 8256.0 | 864 | 7392.0 | 1.541 | 1291 | 33 | 3.714 | 0.105 |
| 2 | 0 | 51 | M | 3 | Graduate | Married | $80K - $120K | Blue | 36 | 4 | 1 | 0 | 3418.0 | 0 | 3418.0 | 2.594 | 1887 | 20 | 2.333 | 0.000 |
| 3 | 0 | 40 | F | 4 | High School | Unknown | Less than $40K | Blue | 34 | 3 | 4 | 1 | 3313.0 | 2517 | 796.0 | 1.405 | 1171 | 20 | 2.333 | 0.760 |
| 4 | 0 | 40 | M | 3 | Uneducated | Married | $60K - $80K | Blue | 21 | 5 | 1 | 0 | 4716.0 | 0 | 4716.0 | 2.175 | 816 | 28 | 2.500 | 0.000 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 10122 | 0 | 50 | M | 2 | Graduate | Single | $40K - $60K | Blue | 40 | 3 | 2 | 3 | 4003.0 | 1851 | 2152.0 | 0.703 | 15476 | 117 | 0.857 | 0.462 |
| 10123 | 1 | 41 | M | 2 | Unknown | Divorced | $40K - $60K | Blue | 25 | 4 | 2 | 3 | 4277.0 | 2186 | 2091.0 | 0.804 | 8764 | 69 | 0.683 | 0.511 |
| 10124 | 1 | 44 | F | 1 | High School | Married | Less than $40K | Blue | 36 | 5 | 3 | 4 | 5409.0 | 0 | 5409.0 | 0.819 | 10291 | 60 | 0.818 | 0.000 |
| 10125 | 1 | 30 | M | 2 | Graduate | Unknown | $40K - $60K | Blue | 36 | 4 | 3 | 3 | 5281.0 | 0 | 5281.0 | 0.535 | 8395 | 62 | 0.722 | 0.000 |
| 10126 | 1 | 43 | F | 2 | Graduate | Married | Less than $40K | Silver | 25 | 6 | 2 | 4 | 10388.0 | 1961 | 8427.0 | 0.703 | 10294 | 61 | 0.649 | 0.189 |
10127 rows × 20 columns
df1.info()
#here I observe i may have to treat Income_Category data type / values.
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10127 entries, 0 to 10126 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Attrition_Flag 10127 non-null int64 1 Customer_Age 10127 non-null int64 2 Gender 10127 non-null object 3 Dependent_count 10127 non-null int64 4 Education_Level 10127 non-null object 5 Marital_Status 10127 non-null object 6 Income_Category 10127 non-null object 7 Card_Category 10127 non-null object 8 Months_on_book 10127 non-null int64 9 Total_Relationship_Count 10127 non-null int64 10 Months_Inactive_12_mon 10127 non-null int64 11 Contacts_Count_12_mon 10127 non-null int64 12 Credit_Limit 10127 non-null float64 13 Total_Revolving_Bal 10127 non-null int64 14 Avg_Open_To_Buy 10127 non-null float64 15 Total_Amt_Chng_Q4_Q1 10127 non-null float64 16 Total_Trans_Amt 10127 non-null int64 17 Total_Trans_Ct 10127 non-null int64 18 Total_Ct_Chng_Q4_Q1 10127 non-null float64 19 Avg_Utilization_Ratio 10127 non-null float64 dtypes: float64(5), int64(10), object(5) memory usage: 1.5+ MB
In this project i'm gonna use the PyCaret (an "auto Machine Learning toolkit") librarie to model the problem solution, thus i'm also gonna use a more automated Exploratory Data Analysis tool - the Pandas Profiling - , just to quickly check all the variables/features characteristics and some other details, like the existing correlations.
Attrition_Flag (our prediction target) has a high correlation with Total_Trans_Ct &
Total_Revolving_Bal variables.
-from the 10127 total reports, only 1627 belong to "Attrited Customer", so, theres a imbalance in our sample
regarding target.
No other problems found in our dataset.
from pandas_profiling import ProfileReport
# generating report with pandas profiling ("auto"-EDA)
profile = ProfileReport(df1, title="Pandas Profiling Report")
profile
Summarize dataset: 0%| | 0/5 [00:00<?, ?it/s]
Generate report structure: 0%| | 0/1 [00:00<?, ?it/s]
Render HTML: 0%| | 0/1 [00:00<?, ?it/s]